Comparison

In this section, we compare cpprb with other replay buffer implementations;

Important Notice

Except cpprb and DeepMind/Reverb, replay buffers are only a part of their reinforcement learning ecosystem. These libraries don’t focus on providing greatest replay buffers but reinforcement learning.

Our motivation is to provide strong replay buffers to researchers and developers who not only use existing networks and/or algorithms but also creating brand-new networks and/or algorithms.

Here, we would like to show that cpprb is enough functional and enough efficient compared with others.

1 OpenAI Baselines

OpenAI Baselines is a set of baseline implementations of reinforcement learning developed by OpenAI.

The source code is published with MIT license.

Ordinary and prioritized experience replay are implemented with ReplayBuffer and PrioritizedReplayBuffer classes, respectively. Using these classes directly is (probably) not expected, but you can import them like this;

from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer

ReplayBuffer is initialized with size parameter for replay buffer size. Additionally, PrioritizedReplayBuffer requires alpha parameter for degree of prioritization, too. These parameters doesn’t have default values, so that you need to specify them.

buffer_size = int(1e6)
alpha = 0.6

rb = ReplayBuffer(buffer_size)
prb = PrioritizedReplayBuffer(buffer_size,alpha)

A transition is stored into replay buffer by calling ReplayBuffer.add(self,obs_t,action,reward,obs_tp1,done).

For PrioritizedReplayBuffer, the maximum priority at that time is automatically used for a newly added transition.

These replay buffers are ring buffers, so that the oldest transition is overwritten by a new transition after the buffer becomes full.

obs_t = [0, 0, 0]
action = [1]
reward = 0.5
obs_tp1 = [1, 1, 1]
done = 0.0

rb.add(obs_t,action,reward,obs_tp1,done)
prb.add(obs_t,action,reward,obs_tp1,done) # Store with max. priority

Stored transitions can be sampled by calling ReplayBuffer.sample(self,batch_size) or PrioritizedReplayBuffer.sample(self,batch_size,beta).

ReplayBuffer returns a tuple of batch size transition. PrioritizedReplayBuffer also returns weights and indexes, too.

batch_size = 32
beta = 0.4

obs_batch, act_batch, rew_batch, next_obs_batch, done_mask = rb.sample(batch_size)
obs_batch, act_batch, rew_batch, next_obs_batch, done_mask, weights, idxes = prb.sample(batch_size)

Priorities can be updated by calling PrioritizedReplayBuffer.update_priorities(self,idxes,priorities).

prb.update_priorities(idxes,priorities)

Internally, these replay buffers utilize Python list for storage, so that the memory usage gradually increase until the buffer becomes full.

2 Ray RLlib

RLlib is reinforcement learning library based on distributed framework Ray.

The source code is published with Apache-2.0 license.

Ordinary and prioritized experience replay are implemented with ReplayBuffer and PrioritizedReplayBuffer classes, respectively.

These classes are decorated with @DeveloperAPI, which are intented to be used by developer when making custom algorithms.

from ray.rllib.execution.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer

These replay buffer classes initialize like OpenAI Baselines;

buffer_size = int(1e6)
alpha = 0.6

rb = ReplayBuffer(buffer_size)
prb = PrioritizedReplayBuffer(buffer_size,alpha)

A transition is stored by calling ReplayBuffer.add(self,item,weight) and PrioritizedReplayBuffer.add(self,item,weight). The item is a instance of ray.rllib.policy.sample_batch.SampleBatch. (The API was changed at ray-0.8.7.)

The SampleBatch class can have any kind of values with batch size, however, only a single transition is allowed for addition. (If you pass SampleBatch having multi-transitions to add() method, it will produce unintentional bug probably.) The key of SampleBatch can be defined by user freely.

In RLlib, PrioritizedReplayBuffer can take weight parameter to specify priority at the same time. In order to unify their API, ReplayBuffer also requires weight parameter, even though it is not used at all. Moreover, the weight does not have default parameter (such as None), you need to pass something explicitly.

obs_t = [0, 0, 0]
action = [1]
reward = 0.5
obs_tp1 = [1, 1, 1]
done = 0.0
weight = 0.5

# For addition, SampleBatch must be initialized with a set of `list` having only a single element.
rb.add(SampleBatch(obs=[obs_t],
                   action=[action],
                   reward=[reward],
                   new_obs=[obs_tp1],
                   done=[done]),None)
prb.add(SampleBatch(obs=[obs_t],
                    action=[action],
                    reward=[reward],
                    new_obs=[obs_tp1],
                    done=[done]),weight)

Like OpenAI Baselines, stored transitions can be sampled by calling ReplayBuffer.sample(self,batch_size) or PrioritizedReplayBuffer.sample(self,batch_size,beta).

ReplayBuffer returns SampleBatch of batch size transition. (The API was changed at ray-0.8.7.) PrioritizedReplayBuffer also returns weights and batch_indexes in SampleBatch, too.

batch_size = 32
beta = 0.4

sample = rb.sample(batch_size)
obs_batch, act_batch, rew_batch, next_obs_batch, done_mask = sample["obs"], sample["action"], sample["reward"], sample["new_obs"], sample["done"]

sample = prb.sample(batch_size)
obs_batch, act_batch, rew_batch, next_obs_batch, done_mask, weights, idxes = sample["obs"], sample["action"], sample["reward"], sample["new_obs"], sample["done"], sample["weights"], sample["batch_indexes"]

Priorities can be also updated by calling PrioritizedReplayBuffer.update_priorities(self,idxes,priorities), too.

prb.update_priorities(idxes,priorities)

Internally, these replay buffers utilize Python list for storage, so that the memory usage gradually increase until the buffer becomes full.

3 Chainer ChainerRL

ChainerRL is a deep reinforcement learning library based on a framework Chainer. Chainer (including ChainerRL) has already stopped active development, and development team (Preferred Networks) joined to PyTorch development.

The source code is published with MIT license.

Ordinary and prioritized experience replay are implemented with ReplayBuffer and PrioritizedReplayBuffer, respectively.

from chainerrl.replay_buffers import ReplayBuffer, PrioritizedReplayBuffer

ChainerRL has slighly different API from OpenAI Baselines' one.

ReplayBuffer is initialized with capacity=None parameter for buffer size and num_steps=1 parameter for Nstep configuration.

PrioritizedReplayBuffer can additionally take parameters of alpha=0.6, beta0=0.4, betasteps=2e5, eps=0.01, normalize_by_max=True, error_min=0, error_max=1.

In ChainerRL, beta (correction for weight) parameter starts from beta0, automatically increases with equal step size during betasteps time iterations and after that becomes 1.0.

buffer_size = int(1e6)
alpha = 0.6

rb = ReplayBuffer(buffer_size)
prb = PrioritizedReplayBuffer(buffer_size,alpha)

A transition is stored by calling ReplayBuffer.append(self,state,action,reward,next_state=None,next_action=None,is_state_terminate=False,env_id=0,**kwargs). Additional keyward arguments are stored, too, so that you can use any custom environment values. By specifying env_id, multiple trajectory can be tracked with Nstep configuration.

obs_t = [0, 0, 0]
action = [1]
reward = 0.5
obs_tp1 = [1, 1, 1]
done = False

rb.add(obs_t,action,reward,obs_tp1,is_state_terminal=done)
prb.add(obs_t,action,reward,obs_tp1,is_state_terminal=done)

Stored transitions are sampled by calling ReplayBuffer.sample(self,num_experience) and PrioritizedReplayBuffer.sample(self,n).

Apart from other implementations, ChainerRL’s replay buffers return unique (non-duplicated) transitions, so that batch_size must be smaller than stored transition size. Furthermore, they return a Python list of a dict of transition instead of a Python tuple of environment values.

batch_size = 32

# Need additional modification to take apart
transitions = rb.sample(batch_size)
transitions_with_weight = prb.sample(batch_size)

Update index cannot be specified manually, but PrioritizedReplayBuffer memorizes sampled indexes. (Without sample, user cannot update priority.)

prb.update_priorities(priorities)

Internally, these replay buffers utilize Python list for storage, so that the memory usage gradually increase until the buffer becomes full. In ChainerRL, storage is not a simple Python list, but two list to pop out the oldest element with O(1) time.

4 DeepMind Reverb

Reverb is relatively new, which was released on 26th May 2020 by DeepMind.

Reverb is a framework for experience replay like cpprb. By utilizing server-client model, Reverb is mainly optimized for large-scale distributed reinforcement learning.

The source code is published with Apache-2.0 license.

Currently (28th June 2020), Reverb officially says it is still non-production level and requries a development version of TensorFlow (i.e. tf-nightly 2.3.0.dev20200604).

Ordinary and prioritized experience replay are constructed by setting reverb.selectors.Uniform() and reverb.selectors.Prioritized(alpha) to sampler argument in reverb.Table constructor, respectively.

Following sample code constructs a server with two replay buffers listening a port 8000.

import reverb

buffer_size = int(1e6)
alpha = 0.6

server = reverb.Server(tables=[reverb.Table(name="ReplayBuffer",
                                            sampler=reverb.selectors.Uniform(),
                                            remover=reverb.selectors.Fifo(),
                                            rate_limiter=reverb.rate_limiters.MinSize(1),
                                            max_size=buffer_size),
                               reverb.Table(name="PrioritizedReplayBuffer",
                                            sampler=reverb.selectors.Prioritized(alpha),
                                            remover=reverb.selectors.Fifo(),
                                            rate_limiter=reverb.rate_limiters.MinSize(1),
                                            max_size=buffer_size)],
                       port=8000)

By changing selector and remover, we can use different algorithms for sampling and overwriting, respectively.

Supported algorithms implemented at reverb.selectors are following;

  • Uniform: Select uniformly.
  • Prioritized: Select proportional to stored priorities.
  • Fifo: Select oldest data.
  • Lifo: Select newest data.
  • MinHeap: Select data with lowest priority.
  • MaxHeap: Select data with highest priority.

There are 3 ways to store a transition.

The first method uses reverb.Client.insert. Not only prioritized replay buffer but also ordinary replay buffer requires priority even though it is not used.

import reverb

client = reverb.Client(f"localhost:{server.port}")

obs_t = [0, 0, 0]
action = [1]
reward = [0.5]
obs_tp1 = [1, 1, 1]
done = [0]

client.insert([obs_t,action,reward,obs_tp1,done],priorities={"ReplayBuffer":1.0})
client.insert([obs_t,action,reward,obs_tp1,done],priorities={"PrioritizedReplayBuffer":1.0})

The second method uses reverb.Client.writer, which is internally used in reverb.Client.insert, too. This method can be more efficient because you can flush multiple items together by calling reverb.Writer.close insead of one by one.

import reverb

client = reverb.Client(f"localhost:{server.port}")

obs_t = [0, 0, 0]
action = [1]
reward = [0.5]
obs_tp1 = [1, 1, 1]
done = [0]

with client.writer(max_sequence_length=1) as writer:
    writer.append([obs_t,action,reward,obs_tp1,done])
    writer.create_item(table="ReplayBuffer",num_timesteps=1,priority=1.0)

    writer.append([obs_t,action,reward,obs_tp1,done])
    writer.create_item(table="PrioritizedReplayBuffer",num_timesteps=1,priority=1.0)

The last method uses reverb.TFClient.insert. This class is designed to be used in TensorFlow graph.

import reverb

tf_client = reverb.TFClient(f"localhost:{server.port}")

obs_t = tf.constant([0, 0, 0])
action = tf.constant([1])
reward = tf.constant([0.5])
obs_tp1 = tf.constant([1, 1, 1])
done = tf.constant([0])

tf_client.insert([obs_t,action,reward,obs_tp1,done],
                 tables=tf.constant(["ReplayBuffer"]),
                 priorities=tf.constant([1.0],dtype=tf.float64))
tf_client.insert([obs_t,action,reward,obs_tp1,done],
                 tables=tf.constant(["PrioritizedReplayBuffer"]),
                 priorities=tf.constant([1.0],dtype=tf.float64))

tables parameter must be tf.Tensor of str with rank 1 and priorities parameter must be tf.Tensor of float64 with rank 1. The lengths of tables and priorities must match.

Sampling transitions can be realized by 3 ways, too.

The first method utilizes reverb.Client.sample, which returns generator of reverb.replay_sample.ReplaySample. As long as we can investigate, beta-parameter is not supported and weight is not calculated at prioritized experience replay.

batch_size = 32

transitions = client.sample("ReplayBuffer",num_samples=batch_size)
transitions_with_priority = client.sample("PrioritizedReplayBuffer",num_samples=batch_size)

The second method uses reverb.TFClient.sample, which does not support batch sampling.

transition = tf_client.sample("ReplayBuffer",
                              [tf.float64,tf.float64,tf.float64,tf.float64,tf.float64])
transition_priority = tf_client.sample("PrioritizedReplayBuffer",
                                       [tf.float64,tf.float64,tf.float64,tf.float64,tf.float64])

The last method is completely different from others, which calls reverb.TFClient.dataset, returning reverb.ReplayDataset derived from tf.data.Dataset.

Once creating ReplayDataset, the dataset can be used as generator and automatically fetches transitions from its replay buffer with proper timing.

dataset = tf_client.dataset("ReplayBuffer",
                            [tf.float64,tf.float64,tf.float64,tf.float64,tf.float64],
                            [4,1,1,4,1])
dataset_priority = tf_client.dataset("PrioritizedReplayBuffer",
                                     [tf.float64,tf.float64,tf.float64,tf.float64,tf.float64],
                                     [4,1,1,4,1])

Priorities can be updated by reverb.Client.mutate_priorities or reverb.TFClient.update_priorities. Aside from other implementations, key is not integer sequence but hash, so that the key must be taken from sampled items by accsessing ReplaySample.info.key.

for t in transitions_with_priority:
    client.mutate_priorities("PriorotizedReplayBuffer",updates={t.info.key: 0.5})

tf_client.update_priorities("PrioritizedReplayBuffer",
                            transition_priority.info.key,
                            priorities=tf.constant([0.5],dtype=tf.float64)

5 Other Implementations

There are also some other replay buffer implementations, which we couldn’t review well. In future, we would like to investigate these implementations and compare with cpprb.

TF-Agents
TensorFlow official reinforcement learning library
SEED RL
Scalable and Effificient Deep-RL
TensorFlow Model Garden
TensorFlow example implementation